Abstract 


Presentation 


Back  to  Agenda 
Next  Session 


Parallel  Matlab:  The  Next  Generation 

Jeremy  Kepner*  (kepner@ll.mit.edu)  and  Nadya  Travinin  (nt@ll.mit.edu) 
MIT  Lincoln  Laboratory,  Lexington,  MA  02420 


Abstract 

The  true  costs  of  high  performance  computing  are 
currently  dominated  by  software.  Addressing  these 
costs  requires  shifting  to  high  productivity  languages 
such  as  Matlab.  The  development  of  MatlabMPI 
(www.ll.mit.edu/MatlabMPI)  was  an  important  first  step 
that  has  brought  parallel  messaging  capabilities  to  the 
Matlab  environment,  and  is  now  widely  used  in  the  com¬ 
munity.  The  ultimate  goal  is  to  move  beyond  basic  mes¬ 
saging  (and  its  inherent  programming  complexity)  to¬ 
wards  higher  level  parallel  data  structures  and  functions. 
The  pMatlab  Parallel  Toolbox  provides  these  capabili¬ 
ties,  and  allows  any  Matlab  user  to  parallelize  their  pro¬ 
gram  by  simply  changing  a  few  characters  in  their  pro¬ 
gram.  The  performance  has  been  tested  on  both  shared 
and  distributed  memory  parallel  computers  (e.g.  Sun, 
SGI,  HP,  IBM,  Linux  and  MacOSX)  on  a  variety  of  ap¬ 
plications. 

1  Introduction 

MATLAB  (R)1  is  the  dominant  interpreted  programming 
language  for  implementing  numerical  computations  and 
is  widely  used  for  algorithm  development,  simulation, 
data  reduction,  testing  and  system  evaluation.  The  pop¬ 
ularity  of  Matlab  is  driven  by  the  high  productivity  that 
is  achieved  by  users  because  one  line  of  Matlab  code  can 
typically  replace  ten  lines  of  C  or  Fortran  code.  Many 
Matlab  programs  can  benefit  from  faster  execution  on 
a  parallel  computer,  but  achieving  this  goal  has  been  a 
significant  challenge  (see  [2]  for  a  reveiw).  MatlabMPI 
[3,  4,  5]  has  brought  parallel  messaging  capabilities  to 
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hundreds  of  Matlab  users  and  is  being  installed  in  sev¬ 
eral  HPC  centers. 

The  ultimate  goal  is  to  move  beyond  basic  mes¬ 
saging  (and  its  inherent  programming  complexity)  to¬ 
wards  higher  level  parallel  data  structures  and  functions. 
pMatlab  achieves  this  by  combinng  operator  overloading 
(first  demonstrated  in  Matlab*P)  with  parallel  maps  (first 
demonstrated  in  Lincoln’s  Parallel  Vector  Library  -  PVL) 
to  provide  implicit  data  parallelism  and  task  parallelism. 

In  addition,  pMatlab  is  built  on  top  of  MatlabMPI  and 
is  a  “pure”  Matlab  implementation  which  runs  anywhere 
Matlab  runs,  and  on  any  heterogeneous  combination  of 
computers.  pMatlab  allows  a  Matlab  user  to  parallelize 
their  program  by  changing  a  few  lines.  For  example,  the 
following  program  is  a  parallel  implementation  of  a  clas¬ 
sic  “corner  turn”  type  of  calculation  commonly  used  in 
signal  processing 

pMATLAB_Init ;  Ncpus=comm_vars.comm_size;  %  Initialize 

mapX  =  map ( [ 1  Ncpus/2 ] ,  {  } ,  [1 :Ncpus/2 ] )  %  Map  X 

mapY  =  map ([Ncpus/2  1] , { } , [Ncpus/2+1 :Ncpus] )  %  Map  Y 

X  =  complex (rand (N, M, mapX) , rand (N, M, mapX) ) ;  %  Create  X 

Y  =  complex ( zeros (N, M, mapY) ;  %  Create  Y 

coefs  =  ...  %  Local  matrix  of  coefs. 

weights  =  ...  %  Local  matrix  of  weights. 

Y(:,:)  =  conv2 (coefs, X) ;  %  Parallel  filter  +  corner  turn. 

Y(:,:)  =  weights*Y;  %  Parallel  matrix  multiply. 

pMATLAB_Finalize;  exit;  %  Finalize  pMATLAB  and  exit. 

The  above  example  illustrates  several  powerful  features 
of  pMatlab:  independence  of  computation  and  parallel 
mapping,  “automatic”  parallel  computation,  and  data  re¬ 
distribution  via  operator  overloading. 


The  vast  majority  of  potential  Matlab  applications  are 
“embarrassingly”  parallel  and  require  minimal  perfor¬ 
mance  out  of  the  communication  capabilities  in  pMat¬ 
lab.  These  applications  exploit  coarse  grain  parallelism 
and  communicate  rarely.  Figure  1  shows  the  speedup 


2  Performance  Results 
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obtained  on  a  typical  parallel  clutter  simulation.  Never- 
the-less,  measuring  the  communication  performance  is 
useful  for  determining  which  applications  are  most  suit¬ 
able  for  pMatlab.  pMatlab  has  been  run  on  several  plat¬ 
forms.  It  has  been  benchmarked  and  compared  to  the 
performance  of  the  underlying  MatlabMPI  upon  which  it 
is  built.  These  results  indicate  that  the  overhead  of  pMat¬ 
lab  is  minimal  (see  Figure  2),  the  primary  difference  is 
in  the  latency:  70  milliseconds  for  pMatlab  compared  to 
35  millieseconds  for  MatlabMPI.  Both  pMatlab  and  Mat¬ 
labMPI  match  the  performance  of  native  C  MPI  [1]  for 
very  large  messages. 

These  results  indicate  that  it  is  possible  to  write  effec¬ 
tive  parallel  programs  in  Matlab  with  minimal  modifica¬ 
tions  to  the  serial  Matlab  code.  In  addition,  these  capa¬ 
bilities  can  be  provided  in  a  library  that  is  written  entirely 
in  Matlab.  Ultimately,  it  is  our  goal  to  establish  a  unified 
interface  for  parallel  Matlab  that  a  broad  community  sup¬ 
ports.  We  are  actively  collaborating  with  Ohio  State,  UC 
Santa  Barbara  and  the  MIT  Laboratory  for  Computer  Sci¬ 
ence  to  provide  a  single  Unified  Parallel  Matlab  interface 
that  is  supported  by  multiple  underlying  implementations 
(e.g.  pMatlab  and  Matlab*P). 
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Number  of  Processors 

Figure  1 :  Clutter  Simulation  Speedup.  Parallel  perfor¬ 
mance  speedup  of  a  radar  clutter  simulation  on  a  cluster 
of  workstations. 


Figure  2:  pMatlab  vs.  MatlabMPI  Bandwidth.  Com¬ 
munication  performance  on  a  “Ping  Pong”  benchmark  as 
a  function  of  message  size  on  a  Linux  cluster.  pMat¬ 
lab  equals  underlying  MatlabMPI  performance  at  large 
message  sizes.  Primary  difference  is  latency  (70  vs.  35 
milliseconds). 
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•  Cost 


4  lines  of  DoD  code 


*  DoD  has  a  clear  need  to  rapidly  develop,  test  and  deploy 
new  techniques  for  analyzing  sensor  data 

-  Most  DoD  algorithm  development  and  simulations  are 
done  in  Matlab 

-  Sensor  analysis  systems  are  implemented  in  other 
languages 

-  Transformation  involves  years  of  software  development, 
testing  and  system  integration 


•  MatlabMPI  allows  any  Matlab  program  to 
become  a  high  performance  parallel  program 
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Challenges:  Why  Has  This  Been  Hard? 


•  Productivity 

-  Most  users  will  not  touch  any  solution  that  requires 
languages  (even  cmex) 

•  Portability 

-  Most  users  will  not  use  a  solution  that  could  potentially  make 
their  code  non-portable  in  the  future 

•  Performance 

-  Most  users  want  to  do  very  simple  parallelism 

-  Most  programs  have  long  latencies  (do  not  require  low 
latency  solutions) 
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MatlabMPI  &  pMatlab  Software  Layers 


Application 


Input 


N 


V 


Analysis 


N 


v 


Output 


/  Vector/Matrix  /  (Comp)  |Condui£> 

z  y 

Task  . 

User 

Parallel 

1/ 

Library  Laver  (pMatlab) 

/ 

Interface 

Library 

k 

Kernel  Layer 

Hardware 

Messaging  (MatlabMPI) 

Math  (Matlab) 

Interface 

Parallel 

Hardware 


•  Can  build  a  parallel  library  with  a 

•  Can  build  a  application  with  a  few 

few  messaging  primitives 

parallel  structures  and  functions 

•  MatlabMPI  provides  this 

•  pMatlab  provides  parallel  arrays 

messaging  capability: 

and  functions 

X  =  ones(n,mapX); 

MPI  Send(dest, comm, tag, X); 

Y  =  zeros(n,mapY); 

Y(:,:)  =  fft(X); 

X  =  MPI Recv(source£omm,tag); 
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MatlabMPI  fuctionality 


•  “Core  Lite”  Parallel  computing  requires  eight  capabilities 

-  MPI  Run  launches  a  Matlab  script  on  multiple  processors 

-  MPLComm_size  returns  the  number  of  processors 

-  MPI_Comm_rank  returns  the  id  of  each  processor 

-  MP1  Send  sends  Matlab  variable(s)  to  another  processor 

-  MPI_Recv  receives  Matlab  variable(s)  from  another  processor 

-  MPIJnit  called  at  beginning  of  program 

-  MPI_Finalize  called  at  end  of  program 

*  Additional  convenience  functions 

-  MPI  Abort  kills  all  jobs 

-  MP!_Bcast  broadcasts  a  message 

-  MPI  Probe  returns  a  list  of  all  incoming  messages 

-  MPI_cc  passes  program  through  Matlab  compiler 

-  MatMPLDelete_all  cleans  up  all  files  after  a  run 

-  MatMPLSave_messages  toggles  deletion  of  messages 

-  MatMPLComm_settings  user  can  set  MatlabMPI  internals 
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MatlabMPI: 

Point-to-point  Communication 


*  Any  messaging  system  can  be  implemented  using  file  I/O 

*  File  I/O  provided  by  Matlab  via  load  and  save  functions 

-  Takes  care  of  complicated  buffer  packing/unpacking  problem 

-  Allows  basic  functions  to  be  implemented  in  -250  lines  of  Matlab  code 


m  lu  ikj 


•  Sender  saves  variable  in  Data  file,  then  creates  Lock  file 

•  Receiver  detects  Lock  file,  then  loads  Data  file 
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Example:  Basic  Send  and  Receive 
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Initialize 

Get  processor  ranks 


Execute  send 
Execute  recieve 


Finalize 

Exit 


MPIJnit;  %  Initialize  MPI. 

comm  =  MPI_COMM_WORLD;  %  Create  communicator. 

comm_size  =  MPI_Comm_size(comm);  %  Get  size. 

my_rank  =  MPI_Comm_rank(comm);  %  Get  rank. 

source  =  0;  %  Set  source. 

dest  =  1 ;  %  Set  destination. 

tag  =  1 ;  %  Set  message  tag. 

if(comm_size  ==  2)  %  Check  size, 

if  (my_rank  ==  source)  %  If  source, 

data  =1:10;  %  Create  data. 

MPI_Send(dest, tag, comm, data);  %  Send  data, 
end 

if  (my_rank  ==  dest)  %  If  destination. 

data=MPI_Recv(source, tag, comm);  %  Receive  data, 

end 
end 


MPI_Finalize; 

exit; 


%  Finalize  Matlab  MPI. 
%  Exit  Matlab 


•  Uses  standard  message  passing  techniques 

•  Will  run  anywhere  Matlab  runs 

•  Only  requires  a  common  file  system 
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pMatlab  Goals 


•  Allow  a  Matlab  user  to  write  parallel  programs  with  the  least 
possible  modification  to  their  existing  matlab  programs 

9  New  parallel  concepts  should  be  intuitive  to  matlab  users 

-  parallel  matrices  and  functions  instead  of  message  passing 

-  Matlab*P  interface 

9  Support  the  types  of  parallelism  we  see  in  our  applications 

-  data  parallelism  (distributed  matrices) 

-  task  parallelism  (distributed  functions) 

-  pipeline  parallelism  (conduits) 

9  Provide  a  single  API  that  potentially  a  wide  number  of  organizations 
could  implement  (e.g.  Mathworks  or  others) 

-  unified  syntax  on  all  platforms 

9  Provide  a  unified  API  that  can  be  implemented  in  multiple  ways, 

-  Matlab*P  implementation 

-  Multimatlab 

-  matlab-all-t he- way-down  implementation 

-  unified  hybrid  implementation  (desired) 

f  1  MIT  Lincoln  Laboratory 
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pMatlab  Library  Functionality 


*  “Core  Lite”  Provides  distributed  array  storage  class  (up  to  4D) 

-  Supports  reference  and  assignment  on  a  variety  of 
distributions: 

Block,  Cyclic,  Block-Cyclic,  Block-Overlap 

Status:  Available 

*  “Core”  Overloads  most  array  math  functions 

-  good  parallel  implementations  for  certain  mappings 

Status:  In  Development 

*  “Core  Plus”  Overloads  entire  Matlab  library 

-  Supports  distributed  cell  arrays 

-  Provides  best  performance  for  every  mapping 

Status:  Research 
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MatlabMPI  vs  MPI  bandwidth 
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•  Bandwidth  matches  native  C  MPI  at  large  message  size 

*  Primary  difference  is  latency  (35  milliseconds  vs.  30  microseconds)  | 
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MatlabMPI  bandwidth  scalability 

Linux  w/Gigabit  Ethernet 
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•  Bandwidth  scales  to  multiple  processors 

•  Cross  mounting  eliminates  bottlenecks 
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MatlabMPI  on  WindowsXP 


MATLAB  6.5 


MATLAB 


Edit  View  Web  Window  Help 

□  &  m  n 


Current  Directory:  Z:^rojectstMPI-Jumpstart-Kit'MatlabMPI'pc 


»  RUN 
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Speedup 


MatlabMPI  Image  Filtering  Performance 


•  Achieved  “classic”  super-linear  speedup  on  fixed  problem 

•  Achieved  speedup  of  ~300  on  304  processors  on  scaled  problem  | 
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“Cognitive”  Algorithms 
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*  Challenge:  applications  requiring  vast  data;  real-time;  large  memory 

*  Approach:  test  parallel  processing  feasibility  using  MatlabMPI  software 

*  Results:  algorithms  rich  in  parallelism;  significant  acceleration  achieved  with 
minimal  (lOOx  less)  programmer  effort 
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Current  MatlabMPI  deployment 


Lincoln  Signal  processing  (7.8  on  8  cpus,  9.4  on  8  duals) 
Lincoln  Radar  simulation  (7.5  on  8  cpus,  11.5  on  8  duals) 
Lincoln  Hyperspectral  Imaging  (~3  on  3  cpus) 


MIT  LCS  Beowulf  (11  Gflops  on  9  duals) 
MIT  Al  Lab  Machine  Vision 
OSU  EM  Simulations 
ARL  SAR  Image  Enhancement 
Wash  U  Hearing  Aid  Simulations 
So.  III.  Benchmarking 
JHU  Digital  Beamforming 
ISL  Radar  simulation 
URI  Heart  modeling 
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Rapidly  growing  MatlabMPI  user  base 
Web  release  creating  hundreds  of  users 
http://www.ll.mit.edu/MatlabMPI 
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pMatlab  vs.  MatlabMPI  bandwidth 
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Linux  Cluster 


•  Bandwidth  matches  underlying  MatlabMPI 

•  Primary  difference  is  latency  (35  milliseconds  vs.  70  milliseconds) 
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Clutter  Simulation  Performance 


Fixed  Problem  Size  (Linux  Cluster) 


Number  of  Processors 


%  Initialize 

pMATLAB  Init;  Ncpus=comm_vars.comm_size; 

%  Map  X  to  first  half  and  Y  to  second  half. 
mapX=map([1  Ncpus/2],{},[1  :Ncpus/2]) 
mapY=map([Ncpus/2  1],{},[Ncpus/2+1  :Ncpus]); 

%  Create  arrays. 

X  =  complex(rand(N,M,mapX),rand(N,M,mapX)); 
Y  =  complex(zeros(N,M,mapY); 

%  Initialize  coefficents 
coefs  = ... 
weights  = ... 

%  Parallel  filter  +  corner  turn. 

Y(:,:)  =  conv2(coefs,X); 

%  Parallel  matrix  multiply. 

Y(:,:)  =  weights*Y; 

%  Finalize  pMATLAB  and  exit. 
pMATLAB  Finalize;  exit; 


•  Achieved  “classic”  super-linear  speedup  on  fixed  problem 

•  Serial  and  Parallel  code  “identical” 
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Matlab  Map  Code 

map3  =  map([2  1],  {},  0:1); 
map2  =  map([1  2],  {},  2:3); 
mapl  =  map([2  1],  {},  4:5); 
mapO  =  map([1  2],  {},  6:7); 


•  Goal:  create  simulated  data  and  use  to  test  signal  processing 

•  parallelize  all  stages;  requires  3  “corner  turns” 

•  pMatlab  allows  serial  and  parallel  code  to  be  nearly  identical 

•  Easy  to  change  parallel  mapping;  set  map=1  to  get  serial  code 

MIT  Lincoln  Laboratory 
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pMatlab  Code 


pMATLABJnit;  SetParameters;  SetMaps;  %lnitialize. 

Xrand  =  0.01*squeeze(complex(rand(Ns,Nb,  mapO),rand(Ns,Nb,  mapO))); 

XO  =  squeeze(complex(zeros(Ns,Nb,  mapO))); 

XI  =  squeeze(complex(zeros(Ns,Nb,  mapl))); 

X2  =  squeeze(complex(zeros(Ns,Nc,  map2))); 

X3  =  squeeze(complex(zeros(Ns,Nc,  map3))); 

X4  =  squeeze(complex(zeros(Ns,Nb,  map3))); 

for  i_time=1  :NUM_TIME  %  Loop  over  time  steps. 

X0(:,:)  =  Xrand;  %  Initialize  data 

for  i_target=1  :NUM_TARGETS 
[i_s  i_c]  =  targets(i_time,i_target,:); 

X0(i_s,i_c)  =  1 ;  %  Insert  targets, 

end 

XI  (:,:)  =  conv2(X0,pulse_shape,'same');  %  Convolve  and  corner  turn. 
X2(:,:)  =  X1*steering_vectors;  %  Channelize  and  corner  turn. 

X3(:,:)  =  conv2(X2, kernel, 'same');  %  Pulse  compress  and  corner  turn. 

X4(:,:)  =  X3*steering_vectors’;  %  Beamform. 

[i_range,i_beam]  =  find(abs(X4)  >  DET);  %  Detect  targets 
end 

pMATLAB_Finalize;  %  Finalize. 


■  Implicitly  Parallel  Code  H  Required  Change 

MIT  Lincoln  Laboratory 
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Outline 


*  Introduction 

*  Approach 

*  Performance  Results 


*  Future  Work  and  Summary 
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•  Same  programmer 
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Development  Time  (Lines  of  Code) 

pMatlab  achieves  high  performance  with  very  little  effort 
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Airborne  Sensor  “QuickLook”  Capability 
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pMatlab  Future  Work 


1 .  Demonstrate  in  a  large  multi-stage  framework 


2.  Incorporate  Expert  Knowledge  into  Standard  Components 


User 
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3.  Port  pMatlab  to  HPEC  systems 
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Summary 


*  MatlabMPI  has  the  basic  functions  necessary  for  parallel 
programming 

-  Size,  rank,  send,  receive,  launch 

-  Enables  complex  applications  or  libraries 

*  Performance  can  match  native  MPI  at  large  message  sizes 

*  Demonstrated  scaling  into  hundreds  of  processors 

*  pMatlab  allows  user’s  to  write  very  complex  parallel  codes 

-  Built  on  top  of  MatlabMPI 

-  Pure  Matlab  (runs  everywhere  Matlab  runs) 

-  Performace  comparable  to  MatlabMPI 

*  Working  with  MIT  LCS,  Ohio  St.  and  UCSB  to  define  a  unified 
parallel  Matlab  interface 
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Web  Links 


MatlabMPI 

http://www.ll.mit.edu/MatlabMPI 

High  Performance  Embedded 
Computing  Workshop 
http://www.ll.mit.edu/HPEC 
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